Geetika Rathee & Eline van Elburg

January 13, 2016

This document creates a model object using lm() using Landsat band reflectance values as predictors for tree cover (VCF). Using the resulting model object, VCF values are predicted for the Gewata area.

Set the proper input and output folder

inFolder <- '/home/user/Projects/AssignmentLesson8/data'
outFolder <- '/home/user/Projects/AssignmentLesson8/output'

library(raster)
## Loading required package: sp

Load all the data, put it into a brick and give them proper names

gewata_list <- list.files(inFolder, pattern = glob2rx('*Gewata*.rda'), full.names = TRUE)

for(i in 1:length(gewata_list)){
    load(gewata_list[i])
}

Cut the clouds and water from the VCF file

vcfGewata[vcfGewata > 100] <- NA
alldata <- brick(GewataB1, GewataB2, GewataB3, GewataB4, GewataB5, GewataB7, vcfGewata)
names(alldata) <- c("band1", "band2", "band3", "band4", "band5", "band7", "VCF")
plot(alldata)

# Extract all data to a dataframe
df <- as.data.frame(getValues(alldata))

Make plots for every band against the VCF layer

for(i in 1:6){
    pairs(alldata[[c(i,7)]], maxpixels=10000)
}

Make the linear model. We decided to leave out band 7 because there were some holes in the image.

model <- lm(VCF ~ band1 + band2 + band3 + band4 + band5, data = df)
summary(model)
## 
## Call:
## lm(formula = VCF ~ band1 + band2 + band3 + band4 + band5, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.204  -4.636   0.715   5.211 258.586 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.550e+01  5.601e-02  1526.5   <2e-16 ***
## band1        9.291e-02  1.876e-04   495.3   <2e-16 ***
## band2       -1.505e-01  2.314e-04  -650.5   <2e-16 ***
## band3       -2.504e-03  1.473e-04   -17.0   <2e-16 ***
## band4        1.661e-02  2.443e-05   679.9   <2e-16 ***
## band5       -2.004e-02  4.158e-05  -482.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.606 on 1808278 degrees of freedom
##   (13712 observations deleted due to missingness)
## Multiple R-squared:  0.8582, Adjusted R-squared:  0.8582 
## F-statistic: 2.189e+06 on 5 and 1808278 DF,  p-value: < 2.2e-16

You can see that the coefficients are significant for each layer. When band 7 is included, the coefficient for band 7 is not significant, another reason why we left it out.

After, predict the VCF map. The linear model does not know that the VCF can only be between 0 and 100 and therefore we transformed all values below 0 to 0.

# Predict the map
vcfMap <- predict(alldata[[1:5]], model = model)

# Change negative values to 0 because VCF can just be between 0 and 100
vcfMap[vcfMap < 0] <- 0
par(mfrow = c(1, 2))
plot(vcfMap, main = "Predicted VCF map")
plot(alldata$VCF, main = "Actual VCF map")

par(mfrow=c(1,1))

The two plots above are the predicted VCF map (left), and the actual VCF map (right). You can see that they are quite similar, however the lower values occur a bit more in the actual plot. This means that probably the land cover class ‘forest’ is easier to predict and therefore more often correct in the predicted map.

Calculate the RMSE and show a histogram of the data.

squared_differences <- (alldata$VCF-vcfMap)^2
plot(squared_differences)

mean_squared = cellStats(squared_differences, stat = 'mean')
RMSE = sqrt(mean_squared)
print(paste("The RMSE is:", RMSE))
## [1] "The RMSE is: 8.44849912661772"

Load training zones and make a raster of the class codes.

load(file.path(inFolder,'trainingPoly.rda'))

# Give the classes a code and make a raster of it
trainingPoly@data$Code <- as.numeric(trainingPoly@data$Class)
trainingPoly@data
##    OBJECTID    Class Code
## 0         1  wetland    3
## 1         2  wetland    3
## 2         3  wetland    3
## 3         4  wetland    3
## 4         5  wetland    3
## 5         6   forest    2
## 6         7   forest    2
## 7         8   forest    2
## 8         9   forest    2
## 9        10   forest    2
## 10       11 cropland    1
## 11       12 cropland    1
## 12       13 cropland    1
## 13       14 cropland    1
## 14       15 cropland    1
## 15       16 cropland    1
classes <- rasterize(trainingPoly, squared_differences, field='Code')

This raster can then be used to perform the zonal statistics and calculate the RMSE for each class.

# Perform zonal statistics to calculate RMSE of each zone
mat_mean <- zonal(squared_differences, classes, fun = 'mean', na.rm=T)
sqrt_vector <- sqrt(mat_mean[,2])
stat_df <- as.data.frame(mat_mean)
stat_df$RMSE <- sqrt_vector
stat_df$class <- c("cropland","forest","wetland")
stat_df
##   zone      mean      RMSE    class
## 1    1  86.33781  9.291814 cropland
## 2    2  24.87072  4.987055   forest
## 3    3 141.92961 11.913421  wetland
worst_zone <- stat_df[which(stat_df$RMSE == max(stat_df$RMSE)), ]
worst_zone
##   zone     mean     RMSE   class
## 3    3 141.9296 11.91342 wetland

You can see in the matrix that zone 1 and 3 (cropland and wetland respectively) have on average a higher RMSE than zone 2 (forest). This was expected when comparing the maps, as forests were usually correctly predicted and the other 2 zones gave more differences. Wetland showed the highest average RMSE.